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ABSTRACT 


In order to deliver the promise of Moore’s Law to the end 
user, compilers must make decisions that are intimately tied 
to a specific target architecture. As engineers add archi- 
tectural features to increase performance, systems become 
harder to model, and thus, it becomes harder for a compiler 
to make effective decisions. 

Machine-learning techniques may be able to help compiler 
writers model modern architectures. Because learning tech- 
niques can effectively make sense of high dimensional spaces, 
they can be a valuable tool for clarifying and discerning 
complex decision boundaries. In our work we focus on loop 
unrolling, a well-known optimization for exposing instruc- 
tion level parallelism. Using the Open Research Compiler 
as a testbed, we demonstrate how one can use supervised 
learning techniques to model the appropriateness of loop 
unrolling. 

We use more than 1,100 loops — drawn from 46 bench- 
marks — to train a simple learning algorithm to recognize 
when loop unrolling is advantageous. The resulting clas- 
sifier can predict with 88% accuracy whether a novel loop 
(ie., one that was not in the training set) benefits from 
loop unrolling. Furthermore, we can predict the optimal or 
nearly optimal unroll factor 74% of the time. We evaluate 
the ramifications of these prediction accuracies using the 
Open Research Compiler (ORC) and the Itanium@®) 2 ar- 
chitecture. The learned classifier yields a 6% speedup (over 
ORC’s unrolling heuristic) for SPEC benchmarks, and a 7% 
speedup on the remainder of our benchmarks. Because the 
learning techniques we employ run very quickly, we were 
able to exhaustively determine the four most salient loop 
characteristics for determining when unrolling is beneficial. 


1. INTRODUCTION 


It is difficult to model modern computer architectures. 
Even earnest attempts to create accurate system-level mod- 
els can yield poor results. There is a good reason for this: 
the components in modern architectures are inextricably 
tied together. Modern compilers are also extremely compli- 
cated tools. Compiler writers have broken the difficult prob- 
lem of compilation into manageable phases that are solved 
in isolation. Due to the nature of the problem, it is not 
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possible for a compiler writer to account for the interactions 
between compiler phases. 

In addition, there are intricate interactions between the 
compiler and the underlying architecture. For instance, the 
configuration of the memory system affects the efficacy of 
the instruction scheduler, and vice versa, the instruction 
scheduler affects the performance of the memory system. 

Compiler writers rely on simple localized models to ab- 
stract away system complexities. For example, register al- 
locators are often written to ignore important interactions 
with the instruction scheduler and instead base decisions 
solely on simple models of the memory system. Without a 
reliable model upon which to base decisions, compiler writ- 
ers necessarily resort to trial-and-error heuristic tweaking to 
achieve a suitable performance. 

The goal of this research is to show that machine-learning 
techniques can be used to improve compiler heuristics. We 
train a compiler (with empirical data) to recognize when 
it is worthwhile to perform a particular optimization. In 
essence, we train the compiler to perform better. Learning 
techniques can find sense even in high-dimensional search 
spaces, and thus they can be effectively applied to compiler 
optimizations where the resulting performance is a function 
of several variables. 

We apply a simple machine-learning technique to the prob- 
lem of loop unrolling. We first try to determine when loop 
unrolling is beneficial. We show that nearest neighbor classi- 
fication, a simple and widely used learning technique works 
remarkably well. A nearest neighbor classifier can predict 
with 88% accuracy when a loop should be unrolled. 

We then extend this result by showing that we can train 
a nearest neighbor classifier to predict not only when a loop 
should be unrolled, but the factor by which it should be un- 
rolled. When we consider all unroll factors through eight, 
we can predict the optimal unroll factor 60% of the time, 
and the optimal or nearly optimal factor 74% of the time. 
We evaluate the implications of improved unrolling decisions 
using the Open Research Compiler and an Itanium@®) 2 ar- 
chitecture. With this infrastructure, the nearest neighbor 
algorithm achieves a 6% speedup (over ORC’s heuristic) for 
the SPEC benchmarks and a 7% speedup on an assortment 
of synthetic benchmarks and kernels. 

While we are pleased with the performance results of our 
research, we are more excited about the prospects of using 
our work to reduce the complexity of compiler development; 
we show that it is possible to offload much compiler design to 
machine-learning algorithms. Apart from creating a train- 


ing data set by which an offline (i.e., performed at compiler 
development time) learning algorithm can be trained, our 
technique requires very little tweaking on the part of the 
compiler writer. Furthermore, engineers can use learning 
techniques, as we do in this paper, to identify the most per- 
tinent characteristics of a system. In this capacity machine 
learning could prove to be an invaluable tool for systems 
engineers who are struggling to keep pace with complexity 
increases. 

The paper is organized as follows. The next section briefly 
states the contributions of this research. Section 3 describes 
the advantages and disadvantages of loop unrolling; it lists 
some important factors that one should consider when trying 
to determine whether unrolling a given loop will be desir- 
able. Section 4 discusses our approach and our infrastruc- 
ture. Section 5 describes the nearest neighbors technique, 
while Section 6 applies nearest neighbors to binary loop clas- 
sification. Section 7 describes experiments with multi-class 
classification. Section 8 relates our work to previous work, 
and we conclude in Section 9. 


2. CONTRIBUTIONS 


The novel aspects of our research are summarized here 
(please see Section 8 for a detailed comparison of our re- 
search and related work): 


e To the best of our knowledge, we are the first to use 
multi-class classification to improve compiler decisions. 
Many compiler decisions involve choosing between one 
of many options, not just making a binary choice. 
While other compiler researchers have employed learn- 
ing techniques for binary problems, none have tried to 
solve harder multi-class problems. 


e We are the first to show that nearest neighbor clas- 
sification is a viable method for improving compiler 
decisions. 


e We show that the nearest neighbor approach can iso- 
late the four most important factors for predicting 
when unrolling is appropriate, and more generally, for 
predicting the best unroll factor. 


e We show that learning techniques can improve the per- 
formance of a well-respected compiler targeting a mod- 
ern architecture. 


We have also publicly released the instrumentation library 
that we wrote and the raw loop data that we collected so 
other researchers can easily apply their own learning tech- 
niques 


3. LOOP UNROLLING 


Loop unrolling is a well known transformation in which 
the loop body is replicated a number of times. Since the 
backward branch is needed only after executing the entire 
unrolled body, loop unrolling reduces overhead by decreas- 
ing the number of branch operations. This can be particu- 
larly important for architectures that have high branching 
overhead. However, loop unrolling is primarily used to en- 
able other optimizations, many of which target the memory 
system. For example, unrolling creates multiple static mem- 
ory instructions corresponding to dynamic executions of a 
single operation. After unrolling, these instructions can be 


rescheduled to exploit memory locality. If the loop accesses 
the same memory locations on consecutive iterations, many 
of these references can be eliminated altogether with scalar 
replacement. Another method to reduce memory traffic uti- 
lizes a wide memory bus to transfer multiple words with a 
single load or store operation. Unrolling is key to expos- 
ing adjacent memory references [5, 7] so that they can be 
merged into a single wide reference. 

Arguably, the most important aspect of loop unrolling is 
its ability to expose instruction level parallelism (ILP) to the 
compiler. After unrolling, the compiler can reschedule the 
operations in the unrolled body to achieve overlap among 
iterations. Such a scheme was first used in the Bulldog 
compiler [6] and is still important in compiling for machines 
that support a high degree of ILP. Typically, unrolling is 
combined with other transformations that increase the size 
of the scheduling window. Examples include trace schedul- 
ing [6] and hyperblock formation [8]. These techniques are 
particularly useful in scheduling for loops that contain con- 
trol flow or function calls because of the difficulty these prob- 
lems present to software pipelining. 

Superficially, loop unrolling appears to be an optimization 
that is always beneficial. However, loop unrolling can impair 
performance in many cases. The following non-exhaustive 
list considers some possible drawbacks to loop unrolling: 


e The most acknowledged detriment of unrolling is that 
code expansion can degrade the performance of the 
instruction cache. 


e Added scheduling freedom can result in an increase 
in the live ranges of variables, resulting in additional 
register pressure. Since memory spills and reloads are 
typically long latency operations, this can negate the 
benefits of unrolling. 


e Control flow also complicates unrolling decisions. If 
the compiler cannot determine that a loop may take 
an exit early, it will actually have to add control flow 
to the unrolled loop that may negate— or at the very 
least neutralize— the benefits of unrolling. 


e Some compilers aggressively speculate on memory ac- 
cesses. Execution time will increase if the scheduler 
chooses to speculatively hoist unrolled memory accesses 
that dynamically conflict. 


Compilers are complex tools. It is nearly impossible to 
know what choices to make based on simple models and as- 
sumptions. The scheduler, the register allocator, and the 
underlying architecture interact in mysterious ways. The 
only way to truly know what will work is to empirically 
evaluate decisions. It is the goal of this research to use em- 
pirical observations to train to a compiler to make informed 
decisions. 


4. METHODOLOGY 
AND INFRASTRUCTURE 


This section briefly introduces supervised classification in 
terms of loop unrolling. A discussion of the infrastructure 
that we use to perform the experiments in this paper follows. 


Feature 


The language (C or Fortran). 

The tripcount of the loop (-1 if unknown). 

The estimated frequency of execution of the loop. 
The loop nest level. 

The maximum dependence height of the loop. 
The average dependence height. 

Number of operations. 

The maximum height of memory dependencies. 
The maximum height of control dependencies. 
The number of parallel “computations” in loop. 
The number of indirect memory references. 
The number of induction variables. 


Minimum num. iterations between memory loop-carried deps. 
Minimum num. iterations between scalar loop-carried deps. 
Number of calls.* 

Number of floating point operations.* 


Number of branches.* 

Number of memory operations.* 

Number of floating point memory operations.* 
Number of distinct predicates used. 

Number of hazards.* 

Number of operands.* 

Number of control speculation instructions.* 
Number of data speculation instructions.* 
Estimated cycles of critical path.* 

Number of live ranges into the loop.* 

Number of live ranges out of the loop.* 
Number of uses in the loop.* 

Number of defs in the loop.* 

Number of definitions that reach the loop entrance.* 
Number of definitions that reach the loop exit.* 


Table 1: A subset of features used for loop classifica- 
tion. These characteristics are used to train the nearest 
neighbors classifier. The features that are marked with 
an asterisk are normalized by the number of operations 
in the loop. 


4.1 Our Approach: Supervised Learning 


The experiments conducted in this paper use an offline 
learning technique known as supervised learning. Though 
the learning algorithm is trained offline, the learned classi- 
fier can easily be incorporated into a compiler. Supervised 
learning is performed on a set of training examples. Each 
training example (x;, yi) is composed of a feature vector x; 
and a corresponding label y;. 

The feature vector contains measurable characteristics of 
the object under consideration. In our experiments, the 
feature vector contains loop characteristics such as the trip 
count of the loop, the number of operations in the loop body, 
the programming language the loop is written in, etc. We 
extract a feature vector for every unrollable loop in our suite 
of benchmarks. Table 1 shows a subset of the features that 
we extracted for the experiments in this paper. In total, we 
use 38 features in these experiments, but as we show later, 
using many fewer features works nearly as well. 

In addition to the feature vector, we also extract a train- 
ing label for each unrollable loop in our benchmark suite. 
The training label indicates which (mutually exclusive) op- 
timization is the best for each training example. For the 
experiments presented in this paper, labeling the data is 
relatively straightforward; we measure each loop using eight 
different unroll factors (1,2,...,8). The label for the loop 
is the unroll factor that yields the best performance. 

To reiterate, for each loop — alternatively referred to as 
a training example — we have a vector of characteristics 
that describes the loop, and a label that indicates what the 
empirically found best action for the loop is. The task of 


a classifier is to learn how best to map loop characteristics 
(x;) to the observed labels (y;) using all the examples in the 
training set. 

Training a classifier usually involves finding a mapping 
from feature vectors to output labels so that the overall 
classification error is minimized on the training examples. 
The hope is that an adequately trained classifier will also be 
able to accurately discriminate novel examples (examples 
that were not in the training set). 


4.2 Computing the Accuracy 


The accuracy numbers presented in this paper are com- 
puted using a methodology known as leave-one-out cross- 
validation (LOOCV). The approach allows machine learning 
researchers to estimate the generalization ability of a learn- 
ing algorithm (i.e., how well new examples can be classified). 

LOOCYV is an iterative process that iterates N times, 
where N is the size of the training data set. On each it- 
eration i, the algorithm removes the i’” example from the 
training set, trains the classifier using the remaining N — 1 
examples, and then sees how well the resulting classifier cat- 
egorizes the left-out example. The generalization accuracy 
is then the number of correctly classified left-out examples 
divided by the total size of the training set. 

There are other methods available for estimating a classi- 
fier’s accuracy, but LOOCV is particularly appealing when 
the size of the training set is small — which ours is — be- 
cause the learning algorithm can be trained using nearly all 
the examples in the dataset. 

Section 5 describes the learning technique that we use to 
perform the experiments in this paper. The remainder of 
this section discusses our infrastructure and how we collect 
training labels. 


4.3 Compiler and Platform 


We used the Open Research Compiler (ORC v2.1) [10]— 
an open source research compiler that targets Itanium archi- 
tectures— to evaluate the benefits of applying learning to 
loop unrolling. ORC is a well-engineered compiler whose 
performance rivals commercial compilers. The experiments 
in this paper target a 1.3 GHz Itanium 2 server running 
Red Hat Linux Advanced Server 2.1. We use -03 optimiza- 
tions for all experiments in the paper. We disable software 
pipelining to focus on the loop unrolling heuristic, and we 
set the maximum unroll factor to eight. 


4.4 Loop Instrumentation 


Because this paper is concerned with loop optimizations, 
we instrumented ORC to measure the runtime of all inner- 
most loops. The instrumented code assigns a counter to 
every loop in the program. Immediately before execution 
reaches an innermost loop, the instrumentation code cap- 
tures the processor’s cycle counter and places it in the loop’s 
associated counter. When the loop exits, the cycle counter 
is again captured and the total running time of the loop is 
computed. 

We invested a great deal of engineering effort minimizing 
the impact that the instrumentation code has on the exe- 
cution of the program. We initially inserted procedure calls 
to an instrumentation library that started and stopped the 
loop timers. This methodology proved to be extremely in- 
trusive since the caller-saved register allocator spilled many 
variables on each call to the instrumentation library. 


Our current loop instrumentor inserts assembly instruc- 
tions that start and stop the loop timers. This lightweight 
model allows the instruction scheduler to bundle instrumen- 
tation code with a loop’s prolog and epilog code. Further- 
more, the instrumentor does not significantly impact register 
usage. 

At all exit points in the program a call is made to our in- 
strumentation library’s finalize procedure. This procedure 
prints the cumulative running time of each loop in the pro- 
gram. This data is used to train the offline learning algo- 
rithms we use; the learning algorithm needs to know which 
loop optimization strategy was most beneficial for each loop, 
and thus these cycle counts form the basis of our labeled 
training data set. 

We realize that we cannot possibly measure loop runtimes 
without affecting the execution in some way. However, we 
have made an earnest — and we think successful — attempt 
at hiding the overhead of loop instrumentation. Neverthe- 
less, to further mitigate the impact of the instrumentation 
on the accuracy of the runtime numbers, we only use loops 
that are run for at least 50,000 cycles. This helps reduce 
the amount of noise in our training data set. For instance, 
were we to train with loops that are only run for a few thou- 
sand cycles, a loop that sits on the edge of an instruction 
cache boundary could introduce huge amounts of noise: a 
cache miss would comprise a significant portion of the total 
runtime of the loop. 

We run each benchmark 30 times for all unroll factors 
up to eight; an unroll factor of one corresponds to leaving 
the loop intact (rolled). For each loop we take the median 
runtime for a given unroll factor. 


4.5 Benchmarks Used 


We use the benchmarks listed in Table 2 to gauge the effi- 
cacy of our approach’. The benchmarks come from a variety 
of benchmark suites and span three languages (C, Fortran, 
and Fortran90). The Table shows the number of loops that 
each benchmark contributes to our training set. In some 
cases, only a small fraction of the loops in a benchmark are 
included in our training set. There are three main reasons 
for this: many of the loops are not unrollable?, we only use 
loops that are run for a minimum of 50,000 cycles, and we 
only use loops where the best unroll factor is measurably 
better (1.1x) than the average over all unroll factors. The 
latter two criteria are intended to filter out noisy examples 
that might prevent a learning algorithm from generalizing. 

There are many different classification techniques that one 
could choose to employ. The next section describes a simple 
technique that works well for a wide range of problems. 


5. NEIGHBOR CLASSIFICATION 


Nearest neighbor (NN) classification is an extremely in- 
tuitive learning technique. The idea of the algorithm is to 
construct a database of all (x;,y:) pairs in the training set. 


Please note that we have excluded some SPEC benchmarks. There 
are two main reasons for this: some of the benchmarks did not compile 
properly with ORC and our loop instrumentor (the loops have to run 
correctly for all unroll factors), and we simply ran out of time. We are 
still adding benchmarks to our training set, and as with most learning 
approaches, our generalization accuracy should improve with the size 
of our training set. 

2ORC only unrolls single block inner-loops. In practice, this is not 
overly restrictive as a hyperblock formation phase precedes the loop 
unrolling phase. 


008.espresso Logic minimization. 

022.1i Lisp interpreter. 

052.alvinn Neural network training. 
099.go Go-playing program. 
101.tomcatv Vectorized mesh generation. 
124.m88ksim Motorola 88100 simulator. 
129.compress Compression. 

146.wave5 Maxwell’s equations. 

164.gzip Compression. 

168.wupwise Physics simulations. 

171.swim Shallow water modeling. 
172.mgrid Multi-grid solver. 

173.applu PDE solver. 

175.vpr Circuit placement and routing. 
176.gcc C compiler. 

177.mesa 3-D graphics library. 
178.galgel Fluid dynamics. 

179.art Image recognition. 

181.mcf Combinatorial optimization. 
183.equake Seismic wave simulation. 
187.facerec Face recognition. 

188.ammp Computational chemistry. 
189.lucas Number theory. 

197.parser Word processing. 

200.sixtrack Particle accelerator simulation. 
255.vortex Object-oriented database. 
256.bzip2 Compression. 

300.twolf Place and route simulator. 
301.apsi Meteorology simulator. 

QCD Quantum chromodynamics. 
TRACK Missile tracking. 

BDNA Molecular dynamics simulation. 
OCEAN 2-D ocean simulation. 

MG3D Depth-migration code. 

TRFD Two-electron integral transformation. 
linpack Linear equation solver. 
mmmul Matrix-matrix multiply. 
purdue bench Synthetic parallel benchmarks. 
vector Test suite for vectorizing compilers. 
whetstone Synthetic benchmark. 


a ee 


Table 2: Benchmarks used. This table lists the bench- 
marks from which loop runtimes are extracted, as well 
as the number of loops contributed by each benchmark. 
We only use loops that ORC can unroll and whose opti- 
mal unroll factor is measurably better than the average 
(1.1x) over all unroll factors up to eight. Note that the 
purdue benchmark entry is actually comprised of nine 
separate benchmarks. 


A label (unroll factor) can be computed for a novel example 
simply by finding the nearest example in the database and 
using its label. This is a sensible approach for assigning loop 
unroll factors: the compiler should treat similar loops simi- 
larly. We use Euclidean distance as the similarity metric; the 
distance between database entry x; and a novel loop with 
feature vector Xnovel is ||(Xnovel — Xi)||. The feature vector 
is normalized to weigh all features equally; otherwise, fea- 
tures with large values such as loop tripcount would grossly 
outweigh small-valued features in the distance calculation. 
The graph in Figure 1 visually depicts the operation of 
a nearest neighbor classifier on real loop data. Each of the 
points in the figure represents a loop from our suite of bench- 
marks. Because there are too many dimensions in the orig- 
inal feature space to graphically depict (equivalent to the 
number of features in Table 1), we have reduced the dimen- 
sionality by projecting loops from the original feature space 
— each of which is represented by a feature vector (xi) — 


Figure 1: Nearest neighbor classification. This figure 
highlights the salient features of the nearest neighbors 
algorithm. Each of the points in the figure corresponds 
to a 2-D projection of normalized loop features. The 
blue crosses correspond to those loops that should be un- 
rolled (according to empirical evaluation) and the black 
dots correspond to those that should not. By using this 
‘database’ of loops, nearest neighbors can quickly predict 
an action for a new loop. Here we query the database 
with loop q to determine that it should not be unrolled. 
To improve the visualization in two dimensions, this fig- 
ure only considers loops where unrolling either degrades 
or improves performance by over 30%. 


onto a plane’. 

The nearest neighbors algorithm makes predictions for a 
new point based on the value of the point’s nearest neighbor 
in the database. In Figure 1, the query point q is nearest 
to a point that has been empirically identified as a loop 
that should not be unrolled. Therefore, the algorithm would 
predict that this loop remain rolled. 

Note that NN classification is trivial to train: we sim- 
ply have to populate a ‘database’ of examples. Though 
the training time of a classifier is not a tantamount con- 
cern (since training the classifier is done offline), the time it 
takes for the resulting classifier to make predictions is im- 
portant (since this task will be performed by the compiler at 
compile time). NN classifies a new example by performing 
a linear scan of the examples in the training set. For small 
aun set sizes — which ours is — the lookup is extremely 
fast”. 


6. BINARY CLASSIFICATION 


Before considering the case in which we attempt to deter- 
mine the best unroll factor, it is instructive to first try to 
determine whether unrolling a given loop is beneficial. 


3 Note that the axes of the graph correspond to a linear combination 
of the dimensions in the original feature space. 

“With over 1100 examples in our database, the linear-time scan takes 
less than 3ms. Lookup time is far outweighed by compiler fixed-point 
dataflow analyses. 
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Figure 2: Nearest neighbor classification into many 


classes. Even when projected from a high dimensional 
space onto a plane, we see that loops that the compiler 
should treat similarly cluster together. The dots, the 
crosses, the circles, and the triangles represent loops 
with unroll factors of one, two, four, and eight respec- 
tively. 


Algorithm 
Nearest Neighbors 0.88 
Always predicting unroll | 0.77 
ORC’s decision 0.72 


Table 3: Accuracy of binary predictors for loop un- 
rolling. This table compares the accuracy of three dif- 
ferent predictors: nearest neighbor classification, always 
predicting unroll, and ORC’s prediction. 


Table 3 summarizes the results. Nearest neighbors pre- 
dicts with 88% accuracy whether all unroll factors are better 
than not unrolling. Thus, regardless of what unroll factor 
ORC chooses, unrolling will be beneficial for positive ex- 
amples. Note that if we were to always predict unroll, we 
would only be right 77% of the time for the 1138 loops in 
our dataset. ORC predicts correctly 72% of the time. 

If instead we try to predict whether unrolling using ORC’s 
unroll factor is beneficial, nearest neighbors correctly pre- 
dicts 89% of the examples. This marginal increase in accu- 
racy is not surprising since the vast majority of the time, 
not unrolling is either among the best decisions or the worst 
decisions (83% of the time it is either the best decision or the 
worst decision). Thus, whatever unroll factor ORC chooses 
will probably be better than not unrolling when some unroll 
factors are beneficial; the converse is also true. 


6.1 The Best Four Features 


The NN algorithm can be trained extremely quickly, as it 
only involves populating the database of examples. It also 
classifies examples quickly for small databases. We therefore 
experimented with exhaustively searching for the most infor- 
mative four features for NN classification. In this capacity 
machine learning techniques can help engineers identify the 


most influential aspects of their systems. 

With four features, the classifier can predict with 85% 
accuracy when to unroll. The four best features are sum- 
marized here: 


e The number of operations in the loop body: It is not 
surprising that this is one of the best features. Large 
loop bodies will not likely expose any exploitable intra- 
iteration parallelism and will significantly increase reg- 
ister pressure when unrolled. 


e The maximum critical path height: Unrolling loops 
with long critical paths will not significantly expose 
ILP because such computations are sequential. 


e Minimum memory to memory loop carried dependency: 


If a memory access is dependent on a memory access 
from the previous iteration, this will hinder code mo- 
tion opportunities when the loop is unrolled. Thus, it 
may make sense to leave the loop rolled. 


e The number of indirect memory references: The fol- 
lowing example from bzip2 illustrates why this feature 
might be valuable: 


rfreq[bt] [szptr[i]]++ 


When the loop is unrolled, the indices into the rfreq 
array can be computed in parallel at the top of the 
loop. By loading the indices early, the loop’s runtime 
can be drastically reduced. On the other hand, ORC 
and Itanium support data speculation; memory alias- 
ing ambiguities introduced by indirect memory refer- 
ences may cause ORC to insert speculative memory 
operations that can impair performance in many cases. 


7. MULTI-CLASS CLASSIFICATION 


While knowing whether a loop should be unrolled is help- 
ful, knowing the optimal unroll factor is even better. In this 
section we describe the operation of a multi-class classifier 
for loop unrolling. Thus, instead of trying to classify a loop 
into one of two categories, we will now use eight categories, 
corresponding to unroll factors one through eight. Recall 
that an unroll factor of one leaves the loop rolled. 

As with the two-class case, we first collect the amount of 
time it takes for each unroll factor to execute each unrollable 
loop in our suite of benchmarks. The unroll factor that 
requires the fewest number of cycles to execute a given loop 
is the label for that loop. 

NN is used for multi-class classification in the same man- 
ner as described above for the two-class case. We train the 
NN algorithm the same way we did for the two-class case, 
except now the predicted unroll factor for a novel loop will 
be the unroll factor for the loop to which it is nearest. 

Table 4 shows the accuracy of the learning algorithm and 
ORC’s heuristic. Using leave-one-out cross validation we 
find that 60% of the time the NN algorithm finds the opti- 
mal unroll factor. A further 14% of the time it chooses the 
nearly-optimal solution. The rightmost column in the table 
shows the cost associated with mispredicting. We can infer 
from the table that a full 74% of the time, NN classification 
is within 6% of the optimal performance. 

The histogram in Figure 3 shows the distribution of opti- 
mal unroll factors. An interesting observation is that non- 
power of two unroll factors are rarely optimal for this data 


1x 


Optimal unroll factor 
Second-best unroll factor 
Third-best unroll factor 
Fourth-best unroll factor 
Fifth-best unroll factor 


Sixth-best unroll factor 
Seventh-best unroll factor 
Worst unroll factor 


Table 4: Accuracy of predictions for the nearest neigh- 
bors algorithm and ORC’s heuristic. This table shows 
the percentage of the predictions that each algorithm 
made that were optimal. In addition, the table shows 
the percentage of predictions made by each algorithm 
that were Nth best. Nearest neighbors predicts the op- 
timal or nearly-optimal unroll factor 74% of the time. 
The Cost column shows the runtime penalty for mispre- 
dicting (as compared to the optimal factor). 
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Figure 3: Histogram of optimal unroll factors. This 
figure shows the percentage of loops for which the given 
unroll factor is optimal. The histogram was constructed 
from 1138 loops spanning several benchmarks suites. 


set. The figure also indicates that no one loop unrolling 
factor is dominantly better than the others. 


7.1 Realizing Speedups 


In this section we see if improved unrolling classification 
accuracy yields program speedups. For these experiments, 
we compile each of the benchmarks in Table 2 using the NN 
algorithm to compute an unroll factor for each loop. We 
do not instrument the compiled code for the experiments 
in this section; we simply use the UNIX time command 
and the median of three trials to measure whole-program 
runtimes. Similar to LOOCV, when compiling a benchmark, 
we exclude all examples from that benchmark in the NN 
database. In this way we see how well the learned compiler 
algorithm performs on loops that it has not seen before. 

Figure 4 shows the performance improvement of the NN 
algorithm over ORC’s unrolling heuristic. The figure also 
shows the speedup that the compiler could obtain if an oracle 
were to make its unrolling decisions. NN achieves a speedup 
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Figure 4: Realized performance on the SPEC benchmarks. We attain speedups on 20 of the 27 benchmarks in this 
graph, and a 6% speedup overall (5% using the geometric mean). The rightmost bar for each benchmark shows the 


speedup that a perfect classifier would attain. 


for most of the SPECs that we include. We achieve speedups 
on 20 of the 27 SPEC benchmarks. Overall our technique 
attains a 6% overall speedup on the SPECs (5% using the 
geometric mean). 

Figure 5 shows the performance of the remaining bench- 
marks in our training set. We achieve speedups on nearly 
all of these benchmarks. However, our predictor badly mis- 
predicted key loops in the nas and mmmul benchmarks. For 
mmmul, the key loop should have been unrolled by a fac- 
tor three, but instead NNs choose an unroll factor of eight. 
Perhaps one reason for the confusion is the fact that un- 
roll factors of three are only optimal 3% of the time. Thus 
there are relatively few examples with this unroll factor in 
the database. 


7.2 The Best Four Features 


We also exhaustively found the best four features for dis- 
criminating between unroll factors. Together, the features 
below allow a nearest neighbor classifier to correctly classify 
55% of the examples in the training set: 


e The number of operations in the loop body is again 
one of the most important discriminating factors. 


e The number of predicated operations in the loop body: 
This feature helps discriminate between loops with 
conditional control flow and simple, control-independent 
loops. The presence of control flow may restrict the 


compiler’s code motion opportunities and render un- 
rolling useless. 


The source code language: One possible explanation 
that this is a key feature is that Fortran code is easier 
to analyze than C code; the fact that C arrays can alias 
forces the compiler to make conservative assumptions 
about memory accesses, and thus some optimizations 
cannot be performed (or alternatively, speculative in- 
structions must be used). Another distinct possibil- 
ity is that the problems people choose to implement 
in Fortran are inherently more amenable to loop un- 
rolling. Fortran has long been the language of choice 
for scientific computing, applications of which are typ- 
ically highly parallel. 


The number of definitions that reach the loop entrance: 
That this feature is included is no surprise. One of the 
primary drawbacks of loop unrolling is the additional 
register pressure that it creates. This characteristic 
serves as a gauge for measuring the register pressure 
surrounding the loop body. 


50.00% 


40.00% 


100%/100% 


GNNv. ORC WOracle v. ORC 


30.00% 


20.00% 


10.00% 


0.00% + 


ol 


linpack 


Improvement 
nas 
ocs 


-10.00% 


sms [L, 


tis 


vector -F— 


purdue01 i 


purdue02 | 


-20.00% 


purdue03 
purdue08 
purdue09 
purdue13 {| 


whetstoned 


-30.00% 


-40.00% 


-50.00% 


-50.00% 


Figure 5: Realized performance on miscellaneous benchmarks. We achieve speedups on 16 of the 19 kernels and 
synthetic benchmarks in this figure. Overall we attain a 7% speedup on these benchmarks (4% using the geometric 
mean). The rightmost bar for each benchmark shows the best speedup that could possibly be attained. 


8. RELATED WORK 


This section discusses relevant related work. Because our 
research focuses on applying learning techniques to compi- 
lation, we emphasize related work in this area. 

Monsifrot et al. use a classifier based on “Boosted” de- 
cision tree learning to determine which loops to unroll [9]. 
While the methodology we present in this paper is similar to 
theirs, our work differs in several important ways. Most im- 
portantly, our experiments employ multi-class classification 
to determine the optimal unroll factor. They only consider 
the two-class classification problem presented in Section 6, 
leaving the choice of unroll factor up to a compiler heuris- 
tic. Another major difference is that they unroll loops be- 
fore compiling, not in the backend as we do. When trying 
to decide whether loop unrolling should be performed using 
ORC’s unroll factor we arrive at roughly the same classifi- 
cation accuracy (our 88% compared to their 85%). These 
numbers may not be comparable however, because we use 
two completely different compiler infrastructures and learn- 
ing algorithms. 

Calder et al. used supervised learning techniques to fine- 
tune static branch prediction heuristics [1]. They employ 
neural networks and decision trees to search for effective 
static branch prediction heuristics. While their technique 
is effective, branch prediction is a binary problem that is 
simpler than the multi-class problem this paper considers. 
Finally, their problem has the benefit that instrumentation 
code to determine branch direction will not affect the direc- 


tion to which branches are resolved. They were therefore 
able to work with a noiseless dataset. We must deal with 
noisy datasets; we measure execution time, but the instru- 
mentation counters we insert have an effect on the measure- 
ment. 

Stephenson et al. use genetic programming to fine-tune 
compiler priority functions [12]. Their unsupervised tech- 
niques seem to work well for the problems they studied, but 
supervised learning of the form presented in this paper is far 
more efficient whenever a training data set can be created. 
Their technique requires weeks to train, while most super- 
vised learning algorithms require minutes or seconds. In 
addition, genetic programming is extremely unstable, with 
back-to-back runs yielding completely different results. 

Cooper et al. use genetic algorithms to solve compilation 
phase ordering problems [4]. Their technique is well suited 
to the task, but cannot be extended to handle other compiler 
problems. 

Several compiler researchers have created model-based sys- 
tems to automatically compute unroll factors [11, 3, 2]. In 
particular Sarkar used in-depth, hand-made system models 
to create a cost function that ranks unroll factors accord- 
ing to estimated performance improvement. His technique 
improved a highly optimized, industry-strength compiler by 
8% on seven of the SPEC95fp benchmarks. While this result 
is impressive, the models developed in [11] are much more 
complicated than the nearest neighbors approach we employ. 
While our test infrastructures are different (and probably 
not comparable), it is worth noting that we achieved a 9% 


improvement on the SPECfp benchmarks in our training set 
(8% using the geometric mean). 


9. CONCLUSION 


Compiler developers have always had to contend with 
complex phenomena that are not easy modeled. For ex- 
ample, it has never been possible to create a useful model 
for all the input programs the compiler has to optimize. 
However until recently, most architectures— the target of 
compiler optimizations— were simple and analyzable. This 
is no longer the case. A complex compiler with multiple in- 
terdependent optimization passes exacerbates the problem. 
In many instances, end-to-end performance can only be eval- 
uated empirically. 

To that end, this paper experiments by using empirical 
evidence to teach a simple machine-learning algorithm how 
to make informed loop unrolling decisions. By using a large 
database of empirical loop observations, our technique clas- 
sifies loop unrolling factors with great precision. Using leave- 
one-out cross-validation to find the generalization ability of 
the classifier (ie., how well it performs on examples that 
are not in the training set), the algorithm is able to deter- 
mine with 88% accuracy when loop unrolling should be per- 
formed. We show that nearest neighbors can also be used to 
predict the optimal unroll factor for a given loop: a full 74% 
of the time it predicts the optimal, or the nearly optimal 
solution. 

We also translate these results into speedups on a real 
machine. Using a well-respected compiler and targeting the 
Itanium 2 architecture, we find that the nearest neighbors 
algorithm improves the performance of 36 of the 46 bench- 
marks in our suite. We achieve a 6% improvement on the 
SPEC benchmarks, and a 7% improvement on miscellaneous 
small benchmarks and kernels. 

While we are pleased with the performance improvements, 
we are more excited about the complexity ramifications of 
our research. Apart from extracting features that we think 
might be pertinent, we thought little about designing an 
unrolling heuristic. Furthermore, we were able to use the 
learning algorithm to exhaustively reduce our large feature 
set to the four most important characteristics for loop un- 
rolling. 

Compiler writers are forced to spend a large portion of 
their time designing heuristics. The results presented in this 
paper lead us to believe that machine-learning techniques 
can create certain heuristics at least as well as human de- 
signers. We hope that automatic heuristic tuning based on 
empirical evaluation will become prevalent, and that design- 
ers will intentionally expose algorithm policies to facilitate 
machine-learning optimization. 
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